80 research outputs found

    Discovering Attractive Products based on Influence Sets

    Full text link
    Skyline queries have been widely used as a practical tool for multi-criteria decision analysis and for applications involving preference queries. For example, in a typical online retail application, skyline queries can help customers select the most interesting, among a pool of available, products. Recently, reverse skyline queries have been proposed, highlighting the manufacturer's perspective, i.e. how to determine the expected buyers of a given product. In this work we develop novel algorithms for two important classes of queries involving customer preferences. We first propose a novel algorithm, termed as RSA, for answering reverse skyline queries. We then introduce a new type of queries, namely the k-Most Attractive Candidates k-MAC query. In this type of queries, given a set of existing product specifications P, a set of customer preferences C and a set of new candidate products Q, the k-MAC query returns the set of k candidate products from Q that jointly maximizes the total number of expected buyers, measured as the cardinality of the union of individual reverse skyline sets (i.e., influence sets). Applying existing approaches to solve this problem would require calculating the reverse skyline set for each candidate, which is prohibitively expensive for large data sets. We, thus, propose a batched algorithm for this problem and compare its performance against a branch-and-bound variant that we devise. Both of these algorithms use in their core variants of our RSA algorithm. Our experimental study using both synthetic and real data sets demonstrates that our proposed algorithms outperform existing, or naive solutions to our studied classes of queries

    Accurate Data Approximation in Constrained Environments

    Get PDF
    Several data reduction techniques have been proposed recently as methods for providing fast and fairly accurate answers to complex queries over large quantities of data. Their use has been widespread, due to the multiple benefits that they may offer in several constrained environments and applications. Compressed data representations require less space to store, less bandwidth to communicate and can provide, due to their size, very fast response times to queries. Sensor networks represent a typical constrained environment, due to the limited processing, storage and battery capabilities of the sensor nodes. Large-scale sensor networks require tight data handling and data dissemination techniques. Transmitting a full-resolution data feed from each sensor back to the base-station is often prohibitive due to (i) limited bandwidth that may not be sufficient to sustain a continuous feed from all sensors and (ii) increased power consumption due to the wireless multi-hop communication. In order to minimize the volume of the transmitted data, we can apply two well data reduction techniques: aggregation and approximation. In this dissertation we propose novel data reduction techniques for the transmission of measurements collected in sensor network environments. We first study the problem of summarizing multi-valued data feeds generated at a single sensor node, a step necessary for the transmission of large amounts of historical information collected at the node. The transmission of these measurements may either be periodic (i.e., when a certain amount of measurements has been collected), or in response to a query from the base station. We then also consider the approximate evaluation of aggregate continuous queries. A continuous query is a query that runs continuously until explicitly terminated by the user. These queries can be used to obtain a live-estimate of some (aggregated) quantity, such as the total number of moving objects detected by the sensors

    Data Reduction Techniques for Sensor Networks

    Get PDF
    We are inevitably moving into a realm where small and inexpensive wireless devices would be seamlessly embedded in the physical world and form a wireless sensor network in order to perform complex monitoring and computational tasks. Such networks pose new challenges in data processing and dissemination due to the conflict between (i) the abundance of information that can be collected and processed in a distributed fashion among thousands of nodes and (ii) the limited resources (bandwidth, energy) that such devices possess. In this paper we propose a new data reduction technique that exploits the correlation and redundancy among multiple measurements on the same sensor and achieves high degree of data reduction while managing to capture even the smallest details of the recorded measurements. The key to our technique is the base signal, a series of values extracted from the real measurements, used for encoding piece-wise linear correlations among the collected data values. We provide efficient algorithms for extracting the base signal features from the data and for encoding the measurements using these features. Our experiments demonstrate that our method by far outperforms standard approximation techniques like Wavelets, Histograms and the Discrete Cosine Transform, on a variety of error metrics and for real datasets from different domains. (UMIACS-TR-2003-80

    Another Outlier Bites the Dust: Computing Meaningful Aggregates in Sensor Networks

    Full text link
    Abstract — Recent work has demonstrated that readings pro-vided by commodity sensor nodes are often of poor quality. In order to provide a valuable sensory infrastructure for monitoring applications, we first need to devise techniques that can withstand “dirty ” and unreliable data during query processing. In this paper we present a novel aggregation framework that detects suspicious measurements by outlier nodes and refrains from incorporating such measurements in the computed aggregate values. We consider different definitions of an outlier node, based on the notion of a user-specified minimum support, and discuss techniques for properly routing messages in the network in order to reduce the bandwidth consumption and the energy drain during the query evaluation. In our experiments using real and synthetic traces we demonstrate that: (i) a straightfor-ward evaluation of a user aggregate query leads to practically meaningless results due to the existence of outliers; (ii) our techniques can detect and eliminate spurious readings without any application specific knowledge of what constitutes normal behavior; (iii) the identification of outliers, when performed inside the network, significantly reduces bandwidth and energy drain compared to alternative methods that centrally collect and analyze all sensory data; and (iv) we can significantly reduce the cost of the aggregation process by utilizing simple statistics on outlier nodes and reorganizing accordingly the collection tree. I

    Outlier-Aware Data Aggregation in Sensor Networks

    Full text link
    Abstract- In this paper we discuss a robust aggregation framework that can detect spurious measurements and refrain from incorporating them in the computed aggregate values. Our framework can consider different definitions of an outlier node, based on a specified minimum support. Our experimental evaluation demonstrates the benefits of our approach. I

    Practical Private Range Search in Depth

    Get PDF
    We consider a data owner that outsources its dataset to an untrusted server. The owner wishes to enable the server to answer range queries on a single attribute, without compromising the privacy of the data and the queries. There are several schemes on “practical” private range search (mainly in database venues) that attempt to strike a trade-off between efficiency and security. Nevertheless, these methods either lack provable security guarantees or permit unacceptable privacy leakages. In this article, we take an interdisciplinary approach, which combines the rigor of security formulations and proofs with efficient data management techniques. We construct a wide set of novel schemes with realistic security/performance trade-offs, adopting the notion of Searchable Symmetric Encryption (SSE), primarily proposed for keyword search. We reduce range search to multi-keyword search using range-covering techniques with tree-like indexes, and formalize the problem as Range Searchable Symmetric Encryption (RSSE). We demonstrate that, given any secure SSE scheme, the challenge boils down to (i) formulating leakages that arise from the index structure and (ii) minimizing false positives incurred by some schemes under heavy data skew. We also explain an important concept in the recent SSE bibliography, namely locality, and design generic and specialized ways to attribute locality to our RSSE schemes. Moreover, we are the first to devise secure schemes for answering range aggregate queries, such as range sums and range min/max. We analytically detail the superiority of our proposals over prior work and experimentally confirm their practicality

    DCC&U: An Extended Digital Curation Lifecycle Model

    Get PDF
    The proliferation of Web, database and social networking technologies has enabled us to produce, publish and exchange digital assets at an enormous rate. This vast amount of information that is either digitized or born-digital needs to be collected, organized and preserved in a way that ensures that our digital assets and the information they carry remain available for future use. Digital curation has emerged as a new inter-disciplinary practice that seeks to set guidelines for disciplined management of information. In this paper we review two recent models for digital curation introduced by the Digital Curation Centre (DCC) and the Digital Curation Unit (DCU) of the Athena Research Centre. We then propose a fusion of the two models that highlights the need to extend the digital curation lifecycle by adding (a) provisions for the registration of usage experience, (b) a stage for knowledge enhancement and (c) controlled vocabularies used by convention to denote concepts, properties and relations. The objective of the proposed extensions is twofold: (i) to provide a more complete lifecycle model for the digital curation domain; and (ii) to provide a stimulus for a broader discussion on the research agenda

    A Fast Approximation Scheme for Probabilistic Wavelet Synopses

    No full text
    Several studies have demonstrated the effectiveness of Haar wavelets in reducing large amounts of data down to compact wavelet synopses that can be used to obtain fast, accurate approximate query answers. While Haar wavelets were originally designed for minimizing the overall root-mean-squared (i.e.,   ¢ ¡-norm) error in the data approximation, the recently-proposed idea of probabilistic wavelet synopses also enables their use in minimizing other error metrics, such as the relative error in individual data-value reconstruction, which is arguably the most important for approximate query processing. Known construction algorithms for probabilistic wavelet synopses employ probabilistic schemes for coefficient thresholding that are based on optimal Dynamic-Programming (DP) formulations over the error-tree structure for Haar coefficients. Unfortunately, these (exact) schemes can scale quite poorly for large data-domain and synopsis sizes. To address this shortcoming, in this paper, we introduce a novel, fast approximation scheme for building probabilistic wavelet synopses over large data sets. Our algorithm’s running time is near-linear in the size of the data-domain (even for very large synopsis sizes) and proportional to £¥¤§ ¦ , where ¦ is the desired approximation guarantee. The key technical idea in our approximation scheme is to make exact DP formulations for probabilistic thresholding much “sparser”, while ensuring a maximum relative degradation of ¦ on the quality of the approximate synopsis, i.e., the desired approximation error metric. Extensive experimental results over synthetic and real-life data clearly demonstrate the benefits of our proposed techniques. 1
    corecore